Generalized Additive Models in Fraud Detection

Data Science Capstone Project

Grace Allen, Kesi Allen, Sonya Melton, Pingping Zhou

2025-11-21

Introduction

What are generalized additive models?

  • Not your typical straight-line regression — GAMs let patterns curve naturally

  • Great at uncovering hidden trends in messy real-world data

  • Each feature gets its own shape, showing where risk rises or falls

  • Makes the model’s behavior easy to explain to non-technical teams

  • Perfect for fraud detection, where small pattern changes matter

Brief History of GAMs

Generalized Additive Models were introduced in the late 1980s as a way to add flexibility to traditional regression models. Trevor Hastie and Robert Tibshirani developed the framework to allow each predictor in a model to follow its own smooth pattern rather than forcing everything into a straight line. Through the 1990s and early 2000s, the approach grew in popularity in fields that needed interpretable models, including public health, ecology, and social sciences.

Brief History of GAMs

A major step forward came with the development of the mgcv package in R, created by Simon Wood. His work added modern smoothing techniques, automatic penalty selection, and faster computation, making GAMs practical for large and noisy datasets. Today, GAMs are widely used in finance, fraud detection, risk scoring, and other areas where organizations need both predictive accuracy and clear explanations.

GAMS in Action: Real World Uses + Our Study

GAMs help uncover nonlinear relationships and subtle patterns across diverse domains:

  • Financial Analytics: Detecting anomalies and potential fraud in transaction data

  • Banking & Insurance: Modeling risk scores in banking and insurance

GAMS in Action: Real World Uses + Our Study

GAMs help uncover nonlinear relationships and subtle patterns across diverse domains:

  • Environmental Science: Forecasting trends in environmental and climate research

  • Public Health: Understanding health outcomes and public health patterns

Our Project: Study Context: GAMs for Fraud Detection

  • Toolset: RStudio + package

  • Dataset: Kaggle’s Fraud Detection Transactions (Ashar, 2024)

  • Purpose: Identify predictive variables linked to fraudulent activity

  • Context: Synthetic but realistic data for controlled testing

Here’s how we used GAMs to explore patterns in the fraud dataset.

Methods

GAM Modeling Overview

  • GAMs extend traditional regression

  • Capture nonlinear predictor-response relationships

  • Use spline-based smooth functions

  • Combine continuous + categorical predictors

  • Fit with mgcv (penalized splines + GCV)

  • Model outputs interpretable smooth effects

  • Goal: Estimate probability of fraud

Modeling Workflow Steps

GAM Equation

\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]

  • (g) = link function (logit for binary fraud)

  • Smooth functions capture nonlinear effects

  • Additive contributions from each predictor

  • Balances flexibility + interpretability

GAM Assumptions (Fraud Context)

  • Logit link approximates fraud probability

  • Additive and independent predictor effects

  • Smooth, gradual functional relationships

  • Binomial response distribution

  • Independent observations

  • Low predictor multicollinearity

  • Penalization prevents overfitting

Why We Chose GAMs For Fraud Detection

  • Captures nonlinear fraud patterns

  • Handles rare, imbalanced outcomes

  • Produces interpretable smooth risk curves

  • Supports regulatory transparency

  • Balances accuracy + interpretability

  • Strong literature support for fraud analytics

  • Scalable through mgcv’s automated smoothing

Practical Advantages & Relevance to Real-World Analytics

  • Supports investigative decision-making

  • Shows monotonic or nonlinear risk curves

  • Supports investigative decision-making

  • Can benchmark or surrogate black-box models

  • High recall for suspicious transactions

  • Useful for auditors, fraud teams, analysts

  • Aligns with both operational and compliance needs

Analysis and Results

Data Exploration and Visualization

Dataset Description

🔹 What It Is

A synthetic dataset built to mimic real financial transactions

Privacy‑safe: no real people’s data used

Hosted on Kaggle

Analysis and Results

Data Exploration and Visualization

🔹 Why We Use It

Train fraud detection models for binary classification tasks

Spot fraud: each transaction labeled as fraud (1) or not fraud (0)

Analysis and Results

Data Exploration and Visualization

🔹 What Makes It Special

Realistic fraud patterns:

Groups of fraudulent transactions

Subtle, hard‑to‑notice anomalies

Odd user behaviors

Large & diverse records: balances normal vs. rare fraud cases → addresses class imbalance.

Data Exploration and Visualization

Key Characteristics

🔹 What’s Inside

50,000 Rows: A good amount of data to work with.

Two Labels: Every transaction is marked as either: 1 = Fraud 0 = Not Fraud

Data Exploration and Visualization

🔹Data Features– 21 features across three categories:

Numbers: Like transaction amounts, risk scores, account balances.

Categories: Transaction types (payment, transfer, withdrawal), device types, merchant categories.

Time Data: When transactions happened (time, day) and their sequence.

Data Exploration and Visualization

🔹Label Distribution Class Imbalance:

Fraudulent transactions are a small percentage, reflecting real-world scenarios.

Behavioral Realism: Includes unusual spending, behavioral signals, and high-risk profiles.

Modeling flexibility: supports interpretable (GAMs, logistic regression) or high-performance (XGBoost) approaches

Distribution of Variables

Table 1 – Transaction Types and Counts
Type Count
POS 12,549
Online 12,546
ATM Withdrawal 12,453
Bank Transfer 12,452
Table 2 – Device Types and Counts
Device Count
Tablet 16,779
Mobile 16,640
Laptop 16,581
Table 3 – Merchant Categories and Counts
Merchant_Category Count
Clothing 10,033
Groceries 10,019
Travel 10,015
Restaurants 9,976
Electronics 9,957

Distribution of Variables

Non-linearity Check

Modeling and Results

Assumptions

GAM Analysis for Numeric Variables

GAM Analysis for Categorical Variables

Table 4. GAM Categorical Coefficients
term estimate std.error statistic p.value OR OR_low OR_high
Transaction_TypeBank Transfer -0.018 0.027 -0.677 0.498 0.982 0.931 1.035
Transaction_TypeOnline -0.016 0.027 -0.604 0.546 0.984 0.933 1.037
Transaction_TypePOS -0.030 0.027 -1.092 0.275 0.971 0.921 1.024
Merchant_CategoryElectronics 0.010 0.030 0.340 0.734 1.010 0.952 1.072
Merchant_CategoryGroceries 0.019 0.030 0.626 0.531 1.019 0.960 1.082
Merchant_CategoryRestaurants 0.042 0.030 1.395 0.163 1.043 0.983 1.107
Merchant_CategoryTravel 0.028 0.030 0.915 0.360 1.028 0.969 1.091
Device_TypeMobile -0.004 0.024 -0.161 0.872 0.996 0.951 1.043
Device_TypeTablet 0.028 0.023 1.182 0.237 1.028 0.982 1.076
Card_TypeDiscover 0.006 0.027 0.202 0.840 1.006 0.953 1.061
Card_TypeMastercard -0.026 0.027 -0.962 0.336 0.974 0.924 1.027
Card_TypeVisa -0.024 0.027 -0.877 0.381 0.977 0.926 1.030
Authentication_MethodOTP 0.020 0.027 0.745 0.456 1.020 0.968 1.076
Authentication_MethodPassword 0.012 0.027 0.445 0.656 1.012 0.960 1.067
Authentication_MethodPIN -0.022 0.027 -0.807 0.420 0.978 0.928 1.032
Is_Weekend 0.000 0.021 -0.012 0.991 1.000 0.960 1.042
IP_Address_Flag 0.029 0.044 0.675 0.500 1.030 0.945 1.122
Previous_Fraudulent_Activity -0.005 0.032 -0.168 0.867 0.995 0.934 1.059

GAM Model For the Key Predictor

GAM Equation for Key Predictor

To examine the association between Risk_Score and fraud, we fit a generalized additive model (GAM) using a logit link function. Following the general GAM structure:

\[ g(\mu) = \alpha + s_1(X_1) + s_2(X_2) + \dots + s_p(X_p) \]

our model simplifies to a single predictor:

\[ \text{logit}(\Pr(\text{Fraud} = 1)) = \alpha + s(\text{Risk\_Score}) \]

where alpha = 1.9109 is the intercept, representing the baseline log-odds of fraud when Risk_Score is zero. The smooth term Risk_Score captures the nonlinear relationship between Risk_Score and fraud probability. The effective degrees of freedom for the smooth term is approximately 9, allowing flexibility in modeling nonlinear changes. Its high significance indicates a strong association with fraud.